{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP Pipeline with spaCy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[spaCy](https://spacy.io/) is a widely used python library with a comprehensive feature set for fast text processing in multiple languages. \n",
    "\n",
    "The usage of the tokenization and annotation engines requires the installation of language models. The features we will use in this chapter only require the small models, the larger models also include word vectors that we will cover in chapter 15."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![spaCy](assets/spacy.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:24.649861Z",
     "start_time": "2020-06-20T17:06:24.648019Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:25.286434Z",
     "start_time": "2020-06-20T17:06:24.651299Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import spacy\n",
    "from spacy import displacy\n",
    "from textacy.extract import ngrams, entities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### SpaCy Language Model Installation\n",
    "\n",
    "In addition to the `spaCy` library, we need [language models](https://spacy.io/usage/models)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### English"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only need to run once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:27.023620Z",
     "start_time": "2020-06-20T17:06:25.288146Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: en_core_web_sm==2.3.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.0/en_core_web_sm-2.3.0.tar.gz#egg=en_core_web_sm==2.3.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (2.3.0)\n",
      "Requirement already satisfied: spacy<2.4.0,>=2.3.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from en_core_web_sm==2.3.0) (2.3.0)\n",
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (3.0.2)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (2.24.0)\n",
      "Requirement already satisfied: setuptools in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (47.3.1.post20200616)\n",
      "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (0.6.0)\n",
      "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (0.9.6)\n",
      "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.0.2)\n",
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.0.2)\n",
      "Requirement already satisfied: numpy>=1.15.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.18.5)\n",
      "Requirement already satisfied: thinc==7.4.1 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (7.4.1)\n",
      "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (0.4.1)\n",
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (4.46.1)\n",
      "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.0.0)\n",
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (2.0.3)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (3.0.4)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.25.9)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (2020.4.5.2)\n",
      "Requirement already satisfied: idna<3,>=2.5 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (2.9)\n",
      "Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (1.6.1)\n",
      "Requirement already satisfied: zipp>=0.5 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->en_core_web_sm==2.3.0) (3.1.0)\n",
      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
      "You can now load the model via spacy.load('en_core_web_sm')\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python -m spacy download en_core_web_sm\n",
    "\n",
    "# more comprehensive models:\n",
    "# {sys.executable} -m spacy download en_core_web_md\n",
    "# {sys.executable} -m spacy download en_core_web_lg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Spanish"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "[Spanish language models](https://spacy.io/models/es#es_core_news_sm) trained on [AnCora Corpus](http://clic.ub.edu/corpus/) and [WikiNER](http://schwa.org/projects/resources/wiki/Wikiner)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only need to run once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:28.725394Z",
     "start_time": "2020-06-20T17:06:27.024637Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: es_core_news_sm==2.3.0 from https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-2.3.0/es_core_news_sm-2.3.0.tar.gz#egg=es_core_news_sm==2.3.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (2.3.0)\n",
      "Requirement already satisfied: spacy<2.4.0,>=2.3.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from es_core_news_sm==2.3.0) (2.3.0)\n",
      "Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.0.2)\n",
      "Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (0.6.0)\n",
      "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (2.24.0)\n",
      "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (3.0.2)\n",
      "Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.0.0)\n",
      "Requirement already satisfied: setuptools in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (47.3.1.post20200616)\n",
      "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (2.0.3)\n",
      "Requirement already satisfied: numpy>=1.15.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.18.5)\n",
      "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.0.2)\n",
      "Requirement already satisfied: plac<1.2.0,>=0.9.6 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (0.9.6)\n",
      "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (4.46.1)\n",
      "Requirement already satisfied: thinc==7.4.1 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (7.4.1)\n",
      "Requirement already satisfied: blis<0.5.0,>=0.4.0 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (0.4.1)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (2020.4.5.2)\n",
      "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.25.9)\n",
      "Requirement already satisfied: idna<3,>=2.5 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (2.9)\n",
      "Requirement already satisfied: chardet<4,>=3.0.2 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (3.0.4)\n",
      "Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (1.6.1)\n",
      "Requirement already satisfied: zipp>=0.5 in /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->es_core_news_sm==2.3.0) (3.1.0)\n",
      "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
      "You can now load the model via spacy.load('es_core_news_sm')\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python -m spacy download es_core_news_sm\n",
    "\n",
    "# more comprehensive model:\n",
    "# {sys.executable} -m spacy download es_core_news_md"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create shortcut names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:29.433123Z",
     "start_time": "2020-06-20T17:06:28.726471Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[38;5;2m✔ Linking successful\u001b[0m\n",
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/en_core_web_sm\n",
      "-->\n",
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy/data/en\n",
      "You can now load the model via spacy.load('en')\n",
      "\u001b[38;5;2m✔ Linking successful\u001b[0m\n",
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/es_core_news_sm\n",
      "-->\n",
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy/data/es\n",
      "You can now load the model via spacy.load('es')\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "python -m spacy link en_core_web_sm en --force;\n",
    "python -m spacy link es_core_news_sm es --force;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Validate Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:29.969953Z",
     "start_time": "2020-06-20T17:06:29.434343Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K\u001b[38;5;2m✔ Loaded compatibility table\u001b[0m\n",
      "\u001b[1m\n",
      "====================== Installed models (spaCy v2.3.0) ======================\u001b[0m\n",
      "\u001b[38;5;4mℹ spaCy installation:\n",
      "/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy\u001b[0m\n",
      "\n",
      "TYPE      NAME              MODEL             VERSION                            \n",
      "package   es-core-news-sm   es_core_news_sm   \u001b[38;5;2m2.3.0\u001b[0m   \u001b[38;5;2m✔\u001b[0m\n",
      "package   en-core-web-sm    en_core_web_sm    \u001b[38;5;2m2.3.0\u001b[0m   \u001b[38;5;2m✔\u001b[0m\n",
      "link      en                en_core_web_sm    \u001b[38;5;2m2.3.0\u001b[0m   \u001b[38;5;2m✔\u001b[0m\n",
      "link      es                es_core_news_sm   \u001b[38;5;2m2.3.0\u001b[0m   \u001b[38;5;2m✔\u001b[0m\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# validate installation\n",
    "!{sys.executable} -m spacy validate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Get Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:29.985993Z",
     "start_time": "2020-06-20T17:06:29.978182Z"
    }
   },
   "outputs": [],
   "source": [
    "DATA_DIR = Path('..', 'data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-22T18:22:17.894231Z",
     "start_time": "2018-11-22T18:22:17.889858Z"
    }
   },
   "source": [
    "- [BBC Articles](http://mlg.ucd.ie/datasets/bbc.html), use raw text files ([download](http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip))\n",
    "    - Data already included in [data](../data) directory, just unzip before first-time use.\n",
    "- [TED2013](http://opus.nlpl.eu/TED2013.php), a parallel corpus of TED talk subtitles in 15 langugages (sample provided) in `results/TED` subfolder of this directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SpaCy Pipeline & Architecture"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### The Processing Pipeline\n",
    "\n",
    "When you call a spaCy model on a text, spaCy \n",
    "\n",
    "1) tokenizes the text to produce a `Doc` object. \n",
    "\n",
    "2) passes the `Doc` object through the processing pipeline that may be customized, and for the default models consists of\n",
    "- a tagger, \n",
    "- a parser and \n",
    "- an entity recognizer. \n",
    "\n",
    "Each pipeline component returns the processed Doc, which is then passed on to the next component.\n",
    "\n",
    "![Architecture](assets/pipeline.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Key Data Structures\n",
    "\n",
    "The central data structures in spaCy are the **Doc** and the **Vocab**. Text annotations are also designed to allow a single source of truth:\n",
    "\n",
    "- The **`Doc`** object owns the sequence of tokens and all their annotations. `Span` and `Token` are views that point into it. It is constructed by the `Tokenizer`, and then modified in place by the components of the pipeline. \n",
    "- The **`Vocab`** object owns a set of look-up tables that make common information available across documents. \n",
    "- The **`Language`** object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.\n",
    "\n",
    "![Architecture](assets/spaCy-architecture.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## SpaCy in Action"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Create & Explore the Language Object"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once installed and linked, we can instantiate a spaCy language model and then call it on a document. As a result, spaCy produces a Doc object that tokenizes the text and processes it according to configurable pipeline components that by default consist of a tagger, a parser, and a named-entity recognizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.485122Z",
     "start_time": "2020-06-20T17:06:29.992962Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "nlp = spacy.load('en') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.489019Z",
     "start_time": "2020-06-20T17:06:30.486017Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "spacy.lang.en.English"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(nlp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.503265Z",
     "start_time": "2020-06-20T17:06:30.489953Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'en'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.lang"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.518873Z",
     "start_time": "2020-06-20T17:06:30.504278Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1m\n",
      "=========================== Info about model 'en' ===========================\u001b[0m\n",
      "\n",
      "lang             en                            \n",
      "name             core_web_sm                   \n",
      "license          MIT                           \n",
      "author           Explosion                     \n",
      "url              https://explosion.ai          \n",
      "email            contact@explosion.ai          \n",
      "description      English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.\n",
      "sources          [{'name': 'OntoNotes 5', 'url': 'https://catalog.ldc.upenn.edu/LDC2013T19', 'license': 'commercial (licensed by Explosion)'}]\n",
      "pipeline         ['tagger', 'parser', 'ner']   \n",
      "version          2.3.0                         \n",
      "spacy_version    >=2.3.0,<2.4.0                \n",
      "parent_package   spacy                         \n",
      "labels           {'tagger': ['$', \"''\", ',', '-LRB-', '-RRB-', '.', ':', 'ADD', 'AFX', 'CC', 'CD', 'DT', 'EX', 'FW', 'HYPH', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NFP', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'XX', '_SP', '``'], 'parser': ['ROOT', 'acl', 'acomp', 'advcl', 'advmod', 'agent', 'amod', 'appos', 'attr', 'aux', 'auxpass', 'case', 'cc', 'ccomp', 'compound', 'conj', 'csubj', 'csubjpass', 'dative', 'dep', 'det', 'dobj', 'expl', 'intj', 'mark', 'meta', 'neg', 'nmod', 'npadvmod', 'nsubj', 'nsubjpass', 'nummod', 'oprd', 'parataxis', 'pcomp', 'pobj', 'poss', 'preconj', 'predet', 'prep', 'prt', 'punct', 'quantmod', 'relcl', 'xcomp'], 'ner': ['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']}\n",
      "link             /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy/data/en\n",
      "source           /home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/en_core_web_sm\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'lang': 'en',\n",
       " 'name': 'core_web_sm',\n",
       " 'license': 'MIT',\n",
       " 'author': 'Explosion',\n",
       " 'url': 'https://explosion.ai',\n",
       " 'email': 'contact@explosion.ai',\n",
       " 'description': 'English multi-task CNN trained on OntoNotes. Assigns context-specific token vectors, POS tags, dependency parse and named entities.',\n",
       " 'sources': [{'name': 'OntoNotes 5',\n",
       "   'url': 'https://catalog.ldc.upenn.edu/LDC2013T19',\n",
       "   'license': 'commercial (licensed by Explosion)'}],\n",
       " 'pipeline': ['tagger', 'parser', 'ner'],\n",
       " 'version': '2.3.0',\n",
       " 'spacy_version': '>=2.3.0,<2.4.0',\n",
       " 'parent_package': 'spacy',\n",
       " 'accuracy': {'las': 89.6760148148,\n",
       "  'uas': 91.5566789084,\n",
       "  'token_acc': 99.7576498212,\n",
       "  'las_per_type': {'compound': {'p': 90.4808743169,\n",
       "    'r': 92.2143016262,\n",
       "    'f': 91.339364519},\n",
       "   'root': {'p': 88.3934259141, 'r': 91.5122337268, 'f': 89.9257963125},\n",
       "   'advcl': {'p': 68.2166446499, 'r': 65.0214051876, 'f': 66.5807117071},\n",
       "   'prep': {'p': 85.668083186, 'r': 86.4286527298, 'f': 86.0466873159},\n",
       "   'det': {'p': 97.5984043636, 'r': 97.7575732866, 'f': 97.6779239826},\n",
       "   'pobj': {'p': 95.9272840758, 'r': 96.5525365164, 'f': 96.2388947595},\n",
       "   'agent': {'p': 88.2747068677, 'r': 94.4444444444, 'f': 91.2554112554},\n",
       "   'nsubjpass': {'p': 92.1215242019, 'r': 91.7435897436, 'f': 91.9321685509},\n",
       "   'auxpass': {'p': 94.8098434004, 'r': 96.5375854214, 'f': 95.6659142212},\n",
       "   'amod': {'p': 91.8095489317, 'r': 90.1976028507, 'f': 90.9964377921},\n",
       "   'npadvmod': {'p': 78.2401902497, 'r': 70.1243339254, 'f': 73.9602847508},\n",
       "   'nsubj': {'p': 95.6152006378, 'r': 94.5639318411, 'f': 95.086660647},\n",
       "   'pcomp': {'p': 84.1463414634, 'r': 86.974789916, 'f': 85.5371900826},\n",
       "   'cc': {'p': 83.2791621524, 'r': 82.7314039703, 'f': 83.0043793869},\n",
       "   'conj': {'p': 77.0646828899, 'r': 76.9288860919, 'f': 76.9967246158},\n",
       "   'ccomp': {'p': 79.3469860896, 'r': 83.6456211813, 'f': 81.4396192742},\n",
       "   'advmod': {'p': 85.6985418786, 'r': 85.0353297443, 'f': 85.3656476946},\n",
       "   'neg': {'p': 93.9830929886, 'r': 94.8319116909, 'f': 94.4055944056},\n",
       "   'poss': {'p': 97.1354166667, 'r': 97.6046698873, 'f': 97.3694779116},\n",
       "   'dobj': {'p': 92.4811439346, 'r': 93.7853557485, 'f': 93.1286838878},\n",
       "   'relcl': {'p': 76.1283497884, 'r': 78.2813633067, 'f': 77.1898462639},\n",
       "   'appos': {'p': 70.5693296602, 'r': 66.6811279826, 'f': 68.5701539148},\n",
       "   'mark': {'p': 90.7485503426, 'r': 91.2294647589, 'f': 90.988372093},\n",
       "   'aux': {'p': 97.805224809, 'r': 97.9793484066, 'f': 97.8922091782},\n",
       "   'acomp': {'p': 89.9175068744, 'r': 88.8989578614, 'f': 89.4053315106},\n",
       "   'case': {'p': 97.5345167653, 'r': 98.998998999, 'f': 98.26130154},\n",
       "   'prt': {'p': 80.987654321, 'r': 88.1720430108, 'f': 84.4272844273},\n",
       "   'nmod': {'p': 74.6825396825, 'r': 57.3430834857, 'f': 64.8741813168},\n",
       "   'xcomp': {'p': 89.1514500537, 'r': 89.3754486719, 'f': 89.2633088367},\n",
       "   'attr': {'p': 90.446904469, 'r': 92.7670311186, 'f': 91.592277351},\n",
       "   'nummod': {'p': 93.5037273695, 'r': 88.6868686869, 'f': 91.0316226024},\n",
       "   'oprd': {'p': 83.3910034602, 'r': 71.9402985075, 'f': 77.2435897436},\n",
       "   'acl': {'p': 74.2789876398, 'r': 68.8488816148, 'f': 71.4609286523},\n",
       "   'dep': {'p': 37.9310344828, 'r': 16.0714285714, 'f': 22.5769669327},\n",
       "   'quantmod': {'p': 87.8733031674, 'r': 78.878960195, 'f': 83.1335616438},\n",
       "   'expl': {'p': 98.2978723404, 'r': 98.9293361884, 'f': 98.6125933831},\n",
       "   'csubj': {'p': 75.1515151515, 'r': 73.3727810651, 'f': 74.251497006},\n",
       "   'preconj': {'p': 59.0909090909, 'r': 60.4651162791, 'f': 59.7701149425},\n",
       "   'dative': {'p': 77.9792746114, 'r': 69.0366972477, 'f': 73.2360097324},\n",
       "   'parataxis': {'p': 65.7980456026, 'r': 43.8177874187, 'f': 52.6041666667},\n",
       "   'predet': {'p': 85.3658536585, 'r': 90.1287553648, 'f': 87.6826722338},\n",
       "   'intj': {'p': 69.9828473413, 'r': 59.7802197802, 'f': 64.4804425128},\n",
       "   'meta': {'p': 81.8181818182, 'r': 34.615384615400004, 'f': 48.6486486486},\n",
       "   'csubjpass': {'p': 71.4285714286, 'r': 83.3333333333, 'f': 76.9230769231}},\n",
       "  'tags_acc': 97.0508922897,\n",
       "  'ents_f': 85.356163617,\n",
       "  'ents_p': 85.4613703849,\n",
       "  'ents_r': 85.2512155592,\n",
       "  'ents_per_type': {'ORG': {'p': 82.6970612469,\n",
       "    'r': 83.4072022161,\n",
       "    'f': 83.0506137085},\n",
       "   'PERSON': {'p': 86.6202346041, 'r': 92.5215348473, 'f': 89.4736842105},\n",
       "   'NORP': {'p': 90.0090009001, 'r': 88.9679715302, 'f': 89.485458613},\n",
       "   'GPE': {'p': 92.1481008618, 'r': 89.937694704, 'f': 91.029481318},\n",
       "   'CARDINAL': {'p': 83.9739413681, 'r': 86.8013468013, 'f': 85.3642384106},\n",
       "   'DATE': {'p': 84.981054082, 'r': 85.9881491809, 'f': 85.4816354816},\n",
       "   'TIME': {'p': 69.1729323308, 'r': 70.2290076336, 'f': 69.696969697},\n",
       "   'PERCENT': {'p': 91.6802610114, 'r': 87.8125, 'f': 89.7047086991},\n",
       "   'WORK_OF_ART': {'p': 45.5445544554, 'r': 34.328358209, 'f': 39.1489361702},\n",
       "   'QUANTITY': {'p': 79.6992481203, 'r': 66.6666666667, 'f': 72.602739726},\n",
       "   'LOC': {'p': 73.6625514403, 'r': 71.6, 'f': 72.61663286},\n",
       "   'FAC': {'p': 40.8602150538, 'r': 46.9135802469, 'f': 43.6781609195},\n",
       "   'MONEY': {'p': 91.7174177832, 'r': 90.5048076923, 'f': 91.1070780399},\n",
       "   'LANGUAGE': {'p': 75.0, 'r': 65.2173913043, 'f': 69.7674418605},\n",
       "   'ORDINAL': {'p': 80.0724637681, 'r': 84.3511450382, 'f': 82.156133829},\n",
       "   'EVENT': {'p': 64.4736842105, 'r': 36.2962962963, 'f': 46.4454976303},\n",
       "   'PRODUCT': {'p': 50.5263157895, 'r': 23.645320197, 'f': 32.2147651007},\n",
       "   'LAW': {'p': 59.0163934426, 'r': 60.0, 'f': 59.5041322314}}},\n",
       " 'speed': {'cpu': 7206.4588205683, 'gpu': None, 'nwords': 291314},\n",
       " 'labels': {'tagger': ['$',\n",
       "   \"''\",\n",
       "   ',',\n",
       "   '-LRB-',\n",
       "   '-RRB-',\n",
       "   '.',\n",
       "   ':',\n",
       "   'ADD',\n",
       "   'AFX',\n",
       "   'CC',\n",
       "   'CD',\n",
       "   'DT',\n",
       "   'EX',\n",
       "   'FW',\n",
       "   'HYPH',\n",
       "   'IN',\n",
       "   'JJ',\n",
       "   'JJR',\n",
       "   'JJS',\n",
       "   'LS',\n",
       "   'MD',\n",
       "   'NFP',\n",
       "   'NN',\n",
       "   'NNP',\n",
       "   'NNPS',\n",
       "   'NNS',\n",
       "   'PDT',\n",
       "   'POS',\n",
       "   'PRP',\n",
       "   'PRP$',\n",
       "   'RB',\n",
       "   'RBR',\n",
       "   'RBS',\n",
       "   'RP',\n",
       "   'SYM',\n",
       "   'TO',\n",
       "   'UH',\n",
       "   'VB',\n",
       "   'VBD',\n",
       "   'VBG',\n",
       "   'VBN',\n",
       "   'VBP',\n",
       "   'VBZ',\n",
       "   'WDT',\n",
       "   'WP',\n",
       "   'WP$',\n",
       "   'WRB',\n",
       "   'XX',\n",
       "   '_SP',\n",
       "   '``'],\n",
       "  'parser': ['ROOT',\n",
       "   'acl',\n",
       "   'acomp',\n",
       "   'advcl',\n",
       "   'advmod',\n",
       "   'agent',\n",
       "   'amod',\n",
       "   'appos',\n",
       "   'attr',\n",
       "   'aux',\n",
       "   'auxpass',\n",
       "   'case',\n",
       "   'cc',\n",
       "   'ccomp',\n",
       "   'compound',\n",
       "   'conj',\n",
       "   'csubj',\n",
       "   'csubjpass',\n",
       "   'dative',\n",
       "   'dep',\n",
       "   'det',\n",
       "   'dobj',\n",
       "   'expl',\n",
       "   'intj',\n",
       "   'mark',\n",
       "   'meta',\n",
       "   'neg',\n",
       "   'nmod',\n",
       "   'npadvmod',\n",
       "   'nsubj',\n",
       "   'nsubjpass',\n",
       "   'nummod',\n",
       "   'oprd',\n",
       "   'parataxis',\n",
       "   'pcomp',\n",
       "   'pobj',\n",
       "   'poss',\n",
       "   'preconj',\n",
       "   'predet',\n",
       "   'prep',\n",
       "   'prt',\n",
       "   'punct',\n",
       "   'quantmod',\n",
       "   'relcl',\n",
       "   'xcomp'],\n",
       "  'ner': ['CARDINAL',\n",
       "   'DATE',\n",
       "   'EVENT',\n",
       "   'FAC',\n",
       "   'GPE',\n",
       "   'LANGUAGE',\n",
       "   'LAW',\n",
       "   'LOC',\n",
       "   'MONEY',\n",
       "   'NORP',\n",
       "   'ORDINAL',\n",
       "   'ORG',\n",
       "   'PERCENT',\n",
       "   'PERSON',\n",
       "   'PRODUCT',\n",
       "   'QUANTITY',\n",
       "   'TIME',\n",
       "   'WORK_OF_ART']},\n",
       " 'link': '/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy/data/en',\n",
       " 'source': '/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/en_core_web_sm'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spacy.info('en')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.521944Z",
     "start_time": "2020-06-20T17:06:30.519802Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def get_attributes(f):\n",
    "    print([a for a in dir(f) if not a.startswith('_')], end=' ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.531848Z",
     "start_time": "2020-06-20T17:06:30.522865Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Defaults', 'add_pipe', 'begin_training', 'create_pipe', 'disable_pipes', 'entity', 'evaluate', 'factories', 'from_bytes', 'from_disk', 'get_pipe', 'has_pipe', 'lang', 'linker', 'make_doc', 'matcher', 'max_length', 'meta', 'parser', 'path', 'pipe', 'pipe_factories', 'pipe_labels', 'pipe_names', 'pipeline', 'preprocess_gold', 'rehearse', 'remove_pipe', 'rename_pipe', 'replace_pipe', 'resume_training', 'tagger', 'tensorizer', 'to_bytes', 'to_disk', 'tokenizer', 'update', 'use_params', 'vocab'] "
     ]
    }
   ],
   "source": [
    "get_attributes(nlp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Explore the Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s illustrate the pipeline using a simple sentence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.553495Z",
     "start_time": "2020-06-20T17:06:30.533274Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "sample_text = 'Apple is looking at buying U.K. startup for $1 billion'\n",
    "doc = nlp(sample_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.558302Z",
     "start_time": "2020-06-20T17:06:30.554487Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'to_utf8_array', 'user_data', 'user_hooks', 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab'] "
     ]
    }
   ],
   "source": [
    "get_attributes(doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.566161Z",
     "start_time": "2020-06-20T17:06:30.559809Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.is_parsed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.574496Z",
     "start_time": "2020-06-20T17:06:30.567134Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.is_sentenced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.582779Z",
     "start_time": "2020-06-20T17:06:30.575463Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.is_tagged"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.594192Z",
     "start_time": "2020-06-20T17:06:30.583760Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Apple is looking at buying U.K. startup for $1 billion'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.602553Z",
     "start_time": "2020-06-20T17:06:30.595237Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['add_flag', 'cfg', 'data_dir', 'from_bytes', 'from_disk', 'get_vector', 'has_vector', 'lang', 'length', 'lex_attr_getters', 'load_extra_lookups', 'lookups', 'lookups_extra', 'morphology', 'prune_vectors', 'reset_vectors', 'set_vector', 'strings', 'to_bytes', 'to_disk', 'vectors', 'vectors_length', 'writing_system'] "
     ]
    }
   ],
   "source": [
    "get_attributes(doc.vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.617277Z",
     "start_time": "2020-06-20T17:06:30.605280Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "487"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc.vocab.length"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Explore `Token` annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The parsed document content is iterable and each element has numerous attributes produced by the processing pipeline. The below sample illustrates how to access the following attributes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.631082Z",
     "start_time": "2020-06-20T17:06:30.618677Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       Apple\n",
       "1          is\n",
       "2     looking\n",
       "3          at\n",
       "4      buying\n",
       "5        U.K.\n",
       "6     startup\n",
       "7         for\n",
       "8           $\n",
       "9           1\n",
       "10    billion\n",
       "dtype: object"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series([token.text for token in doc])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.651336Z",
     "start_time": "2020-06-20T17:06:30.632454Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>lemma</th>\n",
       "      <th>pos</th>\n",
       "      <th>tag</th>\n",
       "      <th>dep</th>\n",
       "      <th>shape</th>\n",
       "      <th>is_alpha</th>\n",
       "      <th>is_stop</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Apple</td>\n",
       "      <td>Apple</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>nsubj</td>\n",
       "      <td>Xxxxx</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>is</td>\n",
       "      <td>be</td>\n",
       "      <td>AUX</td>\n",
       "      <td>VBZ</td>\n",
       "      <td>aux</td>\n",
       "      <td>xx</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>looking</td>\n",
       "      <td>look</td>\n",
       "      <td>VERB</td>\n",
       "      <td>VBG</td>\n",
       "      <td>ROOT</td>\n",
       "      <td>xxxx</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>at</td>\n",
       "      <td>at</td>\n",
       "      <td>ADP</td>\n",
       "      <td>IN</td>\n",
       "      <td>prep</td>\n",
       "      <td>xx</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>buying</td>\n",
       "      <td>buy</td>\n",
       "      <td>VERB</td>\n",
       "      <td>VBG</td>\n",
       "      <td>pcomp</td>\n",
       "      <td>xxxx</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>U.K.</td>\n",
       "      <td>U.K.</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>NNP</td>\n",
       "      <td>compound</td>\n",
       "      <td>X.X.</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>startup</td>\n",
       "      <td>startup</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>NN</td>\n",
       "      <td>dobj</td>\n",
       "      <td>xxxx</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>for</td>\n",
       "      <td>for</td>\n",
       "      <td>ADP</td>\n",
       "      <td>IN</td>\n",
       "      <td>prep</td>\n",
       "      <td>xxx</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>$</td>\n",
       "      <td>$</td>\n",
       "      <td>SYM</td>\n",
       "      <td>$</td>\n",
       "      <td>quantmod</td>\n",
       "      <td>$</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>NUM</td>\n",
       "      <td>CD</td>\n",
       "      <td>compound</td>\n",
       "      <td>d</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>billion</td>\n",
       "      <td>billion</td>\n",
       "      <td>NUM</td>\n",
       "      <td>CD</td>\n",
       "      <td>pobj</td>\n",
       "      <td>xxxx</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       text    lemma    pos  tag       dep  shape  is_alpha  is_stop\n",
       "0     Apple    Apple  PROPN  NNP     nsubj  Xxxxx      True    False\n",
       "1        is       be    AUX  VBZ       aux     xx      True     True\n",
       "2   looking     look   VERB  VBG      ROOT   xxxx      True    False\n",
       "3        at       at    ADP   IN      prep     xx      True     True\n",
       "4    buying      buy   VERB  VBG     pcomp   xxxx      True    False\n",
       "5      U.K.     U.K.  PROPN  NNP  compound   X.X.     False    False\n",
       "6   startup  startup   NOUN   NN      dobj   xxxx      True    False\n",
       "7       for      for    ADP   IN      prep    xxx      True     True\n",
       "8         $        $    SYM    $  quantmod      $     False    False\n",
       "9         1        1    NUM   CD  compound      d     False    False\n",
       "10  billion  billion    NUM   CD      pobj   xxxx      True    False"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame([[t.text, t.lemma_, t.pos_, t.tag_, t.dep_, t.shape_, t.is_alpha, t.is_stop]\n",
    "              for t in doc],\n",
    "             columns=['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Visualize POS Dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualize the syntactic dependency in a browser or notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.658841Z",
     "start_time": "2020-06-20T17:06:30.652276Z"
    }
   },
   "outputs": [],
   "source": [
    "options = {'compact': True, 'bg': 'white',\n",
    "           'color': 'black', 'font': 'Source Sans Pro', 'notebook': True}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.668100Z",
     "start_time": "2020-06-20T17:06:30.659993Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"29c6018396ee430e914cb5ba528da877-0\" class=\"displacy\" width=\"1700\" height=\"362.0\" direction=\"ltr\" style=\"max-width: none; height: 362.0px; color: black; background: white; font-family: Source Sans Pro; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Apple</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">is</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">AUX</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">looking</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">at</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">buying</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">U.K.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">startup</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">for</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">$</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">SYM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">1</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"272.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">billion</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-0\" stroke-width=\"2px\" d=\"M62,227.0 62,177.0 347.0,177.0 347.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M62,229.0 L58,221.0 66,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-1\" stroke-width=\"2px\" d=\"M212,227.0 212,202.0 344.0,202.0 344.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">aux</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M212,229.0 L208,221.0 216,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-2\" stroke-width=\"2px\" d=\"M362,227.0 362,202.0 494.0,202.0 494.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M494.0,229.0 L498.0,221.0 490.0,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-3\" stroke-width=\"2px\" d=\"M512,227.0 512,202.0 644.0,202.0 644.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pcomp</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M644.0,229.0 L648.0,221.0 640.0,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-4\" stroke-width=\"2px\" d=\"M812,227.0 812,202.0 944.0,202.0 944.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M812,229.0 L808,221.0 816,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-5\" stroke-width=\"2px\" d=\"M662,227.0 662,177.0 947.0,177.0 947.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M947.0,229.0 L951.0,221.0 943.0,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-6\" stroke-width=\"2px\" d=\"M962,227.0 962,202.0 1094.0,202.0 1094.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1094.0,229.0 L1098.0,221.0 1090.0,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-7\" stroke-width=\"2px\" d=\"M1262,227.0 1262,177.0 1547.0,177.0 1547.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">quantmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1262,229.0 L1258,221.0 1266,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-8\" stroke-width=\"2px\" d=\"M1412,227.0 1412,202.0 1544.0,202.0 1544.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1412,229.0 L1408,221.0 1416,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-29c6018396ee430e914cb5ba528da877-0-9\" stroke-width=\"2px\" d=\"M1112,227.0 1112,152.0 1550.0,152.0 1550.0,227.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-29c6018396ee430e914cb5ba528da877-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1550.0,229.0 L1554.0,221.0 1546.0,221.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(doc, style='dep', options=options)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Visualize Named Entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.681359Z",
     "start_time": "2020-06-20T17:06:30.669391Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Apple\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " is looking at buying \n",
       "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    U.K.\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
       "</mark>\n",
       " startup for \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    $1 billion\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       "</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(doc, style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Read BBC Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now read a larger set of 2,225 BBC News articles (see GitHub for data source details) that belong to five categories and are stored in individual text files. We \n",
    "- call the .glob() method of the pathlib’s Path object, \n",
    "- iterate over the resulting list of paths, \n",
    "- read all lines of the news article excluding the heading in the first line, and \n",
    "- append the cleaned result to a list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.750226Z",
     "start_time": "2020-06-20T17:06:30.682511Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "files = (DATA_DIR / 'bbc').glob('**/*.txt')\n",
    "bbc_articles = []\n",
    "for i, file in enumerate(sorted(list(files))):\n",
    "    with file.open(encoding='latin1') as f:\n",
    "        lines = f.readlines()\n",
    "        body = ' '.join([l.strip() for l in lines[1:]]).strip()\n",
    "        bbc_articles.append(body)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.753654Z",
     "start_time": "2020-06-20T17:06:30.751203Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2225"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(bbc_articles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.761614Z",
     "start_time": "2020-06-20T17:06:30.754578Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.  The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.  Time Warner said on Friday that it now owns 8% of search-engine Google. But its own internet business, AOL, had has mixed fortunes. It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters. However, the company said AOL\\'s underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues. It hopes to increase subscribers by offering the online service free to TimeWarner internet customers and will try to sign up AOL\\'s existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.  Time Warner\\'s fourth quarter profits were slightly better than analysts\\' expectations. But its film division saw profits slump 27% to $284m, helped by box-office flops Alexander and Catwoman, a sharp contrast to year-earlier, when the third and final film in the Lord of the Rings trilogy boosted results. For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn. \"Our financial performance was strong, meeting or exceeding all of our full-year objectives and greatly enhancing our flexibility,\" chairman and chief executive Richard Parsons said. For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.  TimeWarner is to restate its accounts as part of efforts to resolve an inquiry into AOL by US market regulators. It has already offered to pay $300m to settle charges, in a deal that is under review by the SEC. The company said it was unable to estimate the amount it needed to set aside for legal reserves, which it previously set at $500m. It intends to adjust the way it accounts for a deal with German music publisher Bertelsmann\\'s purchase of a stake in AOL Europe, which it had reported as advertising revenue. It will now book the sale of its stake in AOL Europe as a loss on the value of that stake.'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bbc_articles[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Parse first article through Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.769464Z",
     "start_time": "2020-06-20T17:06:30.762431Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['tagger', 'parser', 'ner']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.pipe_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.841739Z",
     "start_time": "2020-06-20T17:06:30.770222Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "spacy.tokens.doc.Doc"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc = nlp(bbc_articles[0])\n",
    "type(doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Detect sentence boundary\n",
    "Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. \n",
    "\n",
    "Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy computes sentence boundaries from the syntactic parse tree so that punctuation and capitalization play an important but not decisive role. As a result, boundaries will coincide with clause boundaries, even for poorly punctuated text.\n",
    "\n",
    "We can access the parsed sentences using the .sents attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.845300Z",
     "start_time": "2020-06-20T17:06:30.842621Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (Â£600m) for the three months to December, from $639m year-earlier.  ,\n",
       " The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.,\n",
       " TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentences = [s for s in doc.sents]\n",
    "sentences[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.853148Z",
     "start_time": "2020-06-20T17:06:30.846298Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['as_doc', 'char_span', 'conjuncts', 'doc', 'end', 'end_char', 'ent_id', 'ent_id_', 'ents', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'kb_id', 'kb_id_', 'label', 'label_', 'lefts', 'lemma_', 'lower_', 'merge', 'n_lefts', 'n_rights', 'noun_chunks', 'orth_', 'remove_extension', 'rights', 'root', 'sent', 'sentiment', 'set_extension', 'similarity', 'start', 'start_char', 'string', 'subtree', 'tensor', 'text', 'text_with_ws', 'to_array', 'upper_', 'vector', 'vector_norm', 'vocab'] "
     ]
    }
   ],
   "source": [
    "get_attributes(sentences[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.865641Z",
     "start_time": "2020-06-20T17:06:30.854111Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Token</th>\n",
       "      <th>POS Tag</th>\n",
       "      <th>Meaning</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Quarterly</td>\n",
       "      <td>ADJ</td>\n",
       "      <td>adjective</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>profits</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>at</td>\n",
       "      <td>ADP</td>\n",
       "      <td>adposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>US</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>proper noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>media</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>giant</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>TimeWarner</td>\n",
       "      <td>PROPN</td>\n",
       "      <td>proper noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>jumped</td>\n",
       "      <td>VERB</td>\n",
       "      <td>verb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>76</td>\n",
       "      <td>NUM</td>\n",
       "      <td>numeral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>%</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>to</td>\n",
       "      <td>ADP</td>\n",
       "      <td>adposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>$</td>\n",
       "      <td>SYM</td>\n",
       "      <td>symbol</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1.13bn</td>\n",
       "      <td>NUM</td>\n",
       "      <td>numeral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>(</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>punctuation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Â£600</td>\n",
       "      <td>X</td>\n",
       "      <td>other</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Token POS Tag      Meaning\n",
       "0    Quarterly     ADJ    adjective\n",
       "1      profits    NOUN         noun\n",
       "2           at     ADP   adposition\n",
       "3           US   PROPN  proper noun\n",
       "4        media    NOUN         noun\n",
       "5        giant    NOUN         noun\n",
       "6   TimeWarner   PROPN  proper noun\n",
       "7       jumped    VERB         verb\n",
       "8           76     NUM      numeral\n",
       "9            %    NOUN         noun\n",
       "10          to     ADP   adposition\n",
       "11           $     SYM       symbol\n",
       "12      1.13bn     NUM      numeral\n",
       "13           (   PUNCT  punctuation\n",
       "14       Â£600       X        other"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[0]], \n",
    "             columns=['Token', 'POS Tag', 'Meaning']).head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.875996Z",
     "start_time": "2020-06-20T17:06:30.867700Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"8b67eab871a14198a576634e24b7dd0b-0\" class=\"displacy\" width=\"4100\" height=\"737.0\" direction=\"ltr\" style=\"max-width: none; height: 737.0px; color: white; background: #09a3d5; font-family: Source Sans Pro; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Quarterly</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">profits</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">at</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">US</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">media</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">giant</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">TimeWarner</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">jumped</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">76%</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">to</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">$</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">SYM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1700\">1.13bn (</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1700\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1850\">Â£600</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1850\">X</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2000\">m)</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2000\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">for</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2300\">the</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2300\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2450\">three</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2450\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2600\">months</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2600\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2750\">to</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2750\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2900\">December,</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2900\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3050\">from</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3050\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3200\">$</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3200\">SYM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3350\">639</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3350\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3500\">m</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3500\">NUM</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3650\">year-</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3650\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3800\">earlier.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3800\">ADV</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"647.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3950\"> </tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3950\">SPACE</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-0\" stroke-width=\"2px\" d=\"M62,602.0 62,577.0 179.0,577.0 179.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M62,604.0 L58,596.0 66,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-1\" stroke-width=\"2px\" d=\"M212,602.0 212,477.0 1091.0,477.0 1091.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M212,604.0 L208,596.0 216,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-2\" stroke-width=\"2px\" d=\"M212,602.0 212,577.0 329.0,577.0 329.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M329.0,604.0 L333.0,596.0 325.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-3\" stroke-width=\"2px\" d=\"M512,602.0 512,527.0 935.0,527.0 935.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M512,604.0 L508,596.0 516,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-4\" stroke-width=\"2px\" d=\"M662,602.0 662,577.0 779.0,577.0 779.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M662,604.0 L658,596.0 666,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-5\" stroke-width=\"2px\" d=\"M812,602.0 812,577.0 929.0,577.0 929.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M812,604.0 L808,596.0 816,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-6\" stroke-width=\"2px\" d=\"M362,602.0 362,502.0 938.0,502.0 938.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M938.0,604.0 L942.0,596.0 934.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-7\" stroke-width=\"2px\" d=\"M1112,602.0 1112,577.0 1229.0,577.0 1229.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">npadvmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1229.0,604.0 L1233.0,596.0 1225.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-8\" stroke-width=\"2px\" d=\"M1112,602.0 1112,552.0 1382.0,552.0 1382.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1382.0,604.0 L1386.0,596.0 1378.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-9\" stroke-width=\"2px\" d=\"M1562,602.0 1562,577.0 1679.0,577.0 1679.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1562,604.0 L1558,596.0 1566,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-10\" stroke-width=\"2px\" d=\"M1412,602.0 1412,552.0 1682.0,552.0 1682.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-10\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1682.0,604.0 L1686.0,596.0 1678.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-11\" stroke-width=\"2px\" d=\"M1862,602.0 1862,577.0 1979.0,577.0 1979.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-11\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1862,604.0 L1858,596.0 1866,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-12\" stroke-width=\"2px\" d=\"M1712,602.0 1712,552.0 1982.0,552.0 1982.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-12\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">appos</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1982.0,604.0 L1986.0,596.0 1978.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-13\" stroke-width=\"2px\" d=\"M1112,602.0 1112,452.0 2144.0,452.0 2144.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-13\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2144.0,604.0 L2148.0,596.0 2140.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-14\" stroke-width=\"2px\" d=\"M2312,602.0 2312,552.0 2582.0,552.0 2582.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-14\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2312,604.0 L2308,596.0 2316,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-15\" stroke-width=\"2px\" d=\"M2462,602.0 2462,577.0 2579.0,577.0 2579.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-15\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nummod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2462,604.0 L2458,596.0 2466,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-16\" stroke-width=\"2px\" d=\"M2162,602.0 2162,527.0 2585.0,527.0 2585.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-16\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2585.0,604.0 L2589.0,596.0 2581.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-17\" stroke-width=\"2px\" d=\"M2612,602.0 2612,577.0 2729.0,577.0 2729.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-17\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2729.0,604.0 L2733.0,596.0 2725.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-18\" stroke-width=\"2px\" d=\"M2762,602.0 2762,577.0 2879.0,577.0 2879.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-18\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2879.0,604.0 L2883.0,596.0 2875.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-19\" stroke-width=\"2px\" d=\"M1112,602.0 1112,427.0 3047.0,427.0 3047.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-19\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3047.0,604.0 L3051.0,596.0 3043.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-20\" stroke-width=\"2px\" d=\"M3212,602.0 3212,577.0 3329.0,577.0 3329.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-20\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3212,604.0 L3208,596.0 3216,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-21\" stroke-width=\"2px\" d=\"M3362,602.0 3362,577.0 3479.0,577.0 3479.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-21\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3362,604.0 L3358,596.0 3366,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-22\" stroke-width=\"2px\" d=\"M3512,602.0 3512,552.0 3782.0,552.0 3782.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-22\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">npadvmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3512,604.0 L3508,596.0 3516,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-23\" stroke-width=\"2px\" d=\"M3662,602.0 3662,577.0 3779.0,577.0 3779.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-23\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">npadvmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3662,604.0 L3658,596.0 3666,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-24\" stroke-width=\"2px\" d=\"M1112,602.0 1112,402.0 3800.0,402.0 3800.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-24\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">punct</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3800.0,604.0 L3804.0,596.0 3796.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-8b67eab871a14198a576634e24b7dd0b-0-25\" stroke-width=\"2px\" d=\"M3812,602.0 3812,577.0 3929.0,577.0 3929.0,602.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-8b67eab871a14198a576634e24b7dd0b-0-25\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\"></textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3929.0,604.0 L3933.0,596.0 3925.0,596.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "options = {'compact': True, 'bg': '#09a3d5',\n",
    "           'color': 'white', 'font': 'Source Sans Pro'}\n",
    "displacy.render(sentences[0].as_doc(), style='dep', jupyter=True, options=options)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.884204Z",
     "start_time": "2020-06-20T17:06:30.877280Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Quarterly | DATE | Absolute or relative dates or periods\n",
      "US | GPE | Countries, cities, states\n",
      "TimeWarner | ORG | Companies, agencies, institutions, etc.\n",
      "76 | PERCENT | Percentage, including \"%\"\n",
      "% | PERCENT | Percentage, including \"%\"\n",
      "1.13bn | MONEY | Monetary values, including unit\n",
      "the | DATE | Absolute or relative dates or periods\n",
      "three | DATE | Absolute or relative dates or periods\n",
      "months | DATE | Absolute or relative dates or periods\n",
      "to | DATE | Absolute or relative dates or periods\n",
      "December | DATE | Absolute or relative dates or periods\n",
      "639 | MONEY | Monetary values, including unit\n",
      "year | DATE | Absolute or relative dates or periods\n",
      "- | DATE | Absolute or relative dates or periods\n",
      "earlier | DATE | Absolute or relative dates or periods\n"
     ]
    }
   ],
   "source": [
    "for t in sentences[0]:\n",
    "    if t.ent_type_:\n",
    "        print('{} | {} | {}'.format(t.text, t.ent_type_, spacy.explain(t.ent_type_)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.892187Z",
     "start_time": "2020-06-20T17:06:30.885297Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Quarterly\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       " profits at \n",
       "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    US\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
       "</mark>\n",
       " media giant \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    TimeWarner\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " jumped \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    76%\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERCENT</span>\n",
       "</mark>\n",
       " to $\n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    1.13bn\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       " (Â£600m) for \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    the three months to December\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       ", from $\n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    639\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       "m \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    year-earlier\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       ".  </div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(sentences[0].as_doc(), style='ent', jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Named Entity-Recognition with textacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy enables named entity recognition using the .ent_type_ attribute:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Textacy makes access to the named entities that appear in the first article easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.901981Z",
     "start_time": "2020-06-20T17:06:30.893292Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TimeWarner        7\n",
       "AOL               5\n",
       "fourth quarter    3\n",
       "8%                2\n",
       "AOL Europe        2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entities = [e.text for e in entities(doc)]\n",
    "pd.Series(entities).value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### N-Grams with textacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "N-grams combine N consecutive tokens. This can be useful for the bag-of-words model because, depending on the textual context, treating, e.g, ‘data scientist’ as a single token may be more meaningful than the two distinct tokens ‘data’ and ‘scientist’.\n",
    "\n",
    "Textacy makes it easy to view the ngrams of a given length n occurring with at least min_freq times:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:06:30.913506Z",
     "start_time": "2020-06-20T17:06:30.902951Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "fourth quarter     3\n",
       "Time Warner        2\n",
       "quarter profits    2\n",
       "AOL Europe         2\n",
       "company said       2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series([n.text for n in ngrams(doc, n=2, min_freq=2)]).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### The spaCy streaming Pipeline API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pass a larger number of documents through the processing pipeline, we can use spaCy’s streaming API as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:30.337503Z",
     "start_time": "2020-06-20T17:06:30.915060Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 "
     ]
    }
   ],
   "source": [
    "iter_texts = (bbc_articles[i] for i in range(len(bbc_articles)))\n",
    "for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=8)):\n",
    "    if i % 100 == 0:\n",
    "        print(i, end = ' ')\n",
    "    assert doc.is_parsed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Multi-language Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy includes trained language models for English, German, Spanish, Portuguese, French, Italian and Dutch, as well as a multi-language model for named-entity recognition. Cross-language usage is straightforward since the API does not change.\n",
    "\n",
    "We will illustrate the Spanish language model using a parallel corpus of TED talk subtitles. For this purpose, we instantiate both language models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### Create a Spanish Language Object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.594645Z",
     "start_time": "2020-06-20T17:07:30.338771Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "model = {}\n",
    "for language in ['en', 'es']:\n",
    "    model[language] = spacy.load(language) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Read bilingual TED2013 samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.600010Z",
     "start_time": "2020-06-20T17:07:31.595637Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "text = {}\n",
    "path = Path('data', 'TED')\n",
    "for language in ['en', 'es']:\n",
    "    file_name = path /  'TED2013_sample.{}'.format(language)\n",
    "    text[language] = file_name.read_text()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Sentence Boundaries English vs Spanish"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.708542Z",
     "start_time": "2020-06-20T17:07:31.601060Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sentences: en 21\n",
      "Sentences: es 22\n"
     ]
    }
   ],
   "source": [
    "parsed, sentences = {}, {}\n",
    "for language in ['en', 'es']:\n",
    "    parsed[language] = model[language](text[language])\n",
    "    sentences[language] = list(parsed[language].sents)\n",
    "    print('Sentences:', language, len(sentences[language]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.713024Z",
     "start_time": "2020-06-20T17:07:31.709584Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      " 1\n",
      "English:\t There's a tight and surprising link between the ocean's health and ours, says marine biologist Stephen Palumbi.\n",
      "Spanish:\t Existe una estrecha y sorprendente relación entre nuestra salud y la salud del océano, dice el biologo marino Stephen Palumbi.\n",
      "\n",
      " 2\n",
      "English:\t He shows how toxins at the bottom of the ocean food chain find their way into our bodies, with a shocking story of toxic contamination from a Japanese fish market.\n",
      "Spanish:\t Nos muestra, através de una impactante historia acerca de la contaminación tóxica en el mercado pesquero japonés, como las toxinas de la cadena alimenticia del fondo oceánico llegan a nuestro cuerpo.\n",
      "\n",
      " 3\n",
      "English:\t His work points a way forward for saving the oceans' health -- and humanity's.\n",
      "Spanish:\t fish,health,mission blue,oceans,science 899 Stephen Palumbi: Siguiendo el camino del mercurio.\n",
      "\n",
      " 4\n",
      "English:\t fish,health,mission blue,oceans,science 899 Stephen Palumbi:\n",
      "Spanish:\t El océano puede ser una cosa muy complicada.\n",
      "\n",
      " 5\n",
      "English:\t Following the mercury trail It can be a very complicated thing, the ocean.\n",
      "Spanish:\t Y podria ser una cosa muy complicada lo que la salud humana es.\n",
      "\n",
      " 6\n",
      "English:\t And it can be a very complicated thing, what human health is.\n",
      "Spanish:\t Y unirlas, podría ser una tarea desalentadora.\n"
     ]
    }
   ],
   "source": [
    "for i, (en, es) in enumerate(zip(sentences['en'], sentences['es']), 1):\n",
    "    print('\\n', i)\n",
    "    print('English:\\t', en)\n",
    "    print('Spanish:\\t', es)\n",
    "    if i > 5: \n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### POS Tagging English vs Spanish"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.728776Z",
     "start_time": "2020-06-20T17:07:31.715278Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "pos = {}\n",
    "for language in ['en', 'es']:\n",
    "    pos[language] = pd.DataFrame([[t.text, t.pos_, spacy.explain(t.pos_)] for t in sentences[language][0]],\n",
    "                                 columns=['Token', 'POS Tag', 'Meaning'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.748749Z",
     "start_time": "2020-06-20T17:07:31.729789Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Token</th>\n",
       "      <th>POS Tag</th>\n",
       "      <th>Meaning</th>\n",
       "      <th>Token</th>\n",
       "      <th>POS Tag</th>\n",
       "      <th>Meaning</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>There</td>\n",
       "      <td>PRON</td>\n",
       "      <td>pronoun</td>\n",
       "      <td>Existe</td>\n",
       "      <td>VERB</td>\n",
       "      <td>verb</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>'s</td>\n",
       "      <td>AUX</td>\n",
       "      <td>auxiliary</td>\n",
       "      <td>una</td>\n",
       "      <td>DET</td>\n",
       "      <td>determiner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>a</td>\n",
       "      <td>DET</td>\n",
       "      <td>determiner</td>\n",
       "      <td>estrecha</td>\n",
       "      <td>ADJ</td>\n",
       "      <td>adjective</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>tight</td>\n",
       "      <td>ADJ</td>\n",
       "      <td>adjective</td>\n",
       "      <td>y</td>\n",
       "      <td>CCONJ</td>\n",
       "      <td>coordinating conjunction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>and</td>\n",
       "      <td>CCONJ</td>\n",
       "      <td>coordinating conjunction</td>\n",
       "      <td>sorprendente</td>\n",
       "      <td>ADJ</td>\n",
       "      <td>adjective</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>surprising</td>\n",
       "      <td>ADJ</td>\n",
       "      <td>adjective</td>\n",
       "      <td>relación</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>link</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "      <td>entre</td>\n",
       "      <td>ADP</td>\n",
       "      <td>adposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>between</td>\n",
       "      <td>ADP</td>\n",
       "      <td>adposition</td>\n",
       "      <td>nuestra</td>\n",
       "      <td>DET</td>\n",
       "      <td>determiner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>the</td>\n",
       "      <td>DET</td>\n",
       "      <td>determiner</td>\n",
       "      <td>salud</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>ocean</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "      <td>y</td>\n",
       "      <td>CCONJ</td>\n",
       "      <td>coordinating conjunction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>'s</td>\n",
       "      <td>PART</td>\n",
       "      <td>particle</td>\n",
       "      <td>la</td>\n",
       "      <td>DET</td>\n",
       "      <td>determiner</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>health</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "      <td>salud</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>and</td>\n",
       "      <td>CCONJ</td>\n",
       "      <td>coordinating conjunction</td>\n",
       "      <td>del</td>\n",
       "      <td>ADP</td>\n",
       "      <td>adposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ours</td>\n",
       "      <td>PRON</td>\n",
       "      <td>pronoun</td>\n",
       "      <td>océano</td>\n",
       "      <td>NOUN</td>\n",
       "      <td>noun</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>,</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>punctuation</td>\n",
       "      <td>,</td>\n",
       "      <td>PUNCT</td>\n",
       "      <td>punctuation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Token POS Tag                   Meaning         Token POS Tag  \\\n",
       "0        There    PRON                   pronoun        Existe    VERB   \n",
       "1           's     AUX                 auxiliary           una     DET   \n",
       "2            a     DET                determiner      estrecha     ADJ   \n",
       "3        tight     ADJ                 adjective             y   CCONJ   \n",
       "4          and   CCONJ  coordinating conjunction  sorprendente     ADJ   \n",
       "5   surprising     ADJ                 adjective      relación    NOUN   \n",
       "6         link    NOUN                      noun         entre     ADP   \n",
       "7      between     ADP                adposition       nuestra     DET   \n",
       "8          the     DET                determiner         salud    NOUN   \n",
       "9        ocean    NOUN                      noun             y   CCONJ   \n",
       "10          's    PART                  particle            la     DET   \n",
       "11      health    NOUN                      noun         salud    NOUN   \n",
       "12         and   CCONJ  coordinating conjunction           del     ADP   \n",
       "13        ours    PRON                   pronoun        océano    NOUN   \n",
       "14           ,   PUNCT               punctuation             ,   PUNCT   \n",
       "\n",
       "                     Meaning  \n",
       "0                       verb  \n",
       "1                 determiner  \n",
       "2                  adjective  \n",
       "3   coordinating conjunction  \n",
       "4                  adjective  \n",
       "5                       noun  \n",
       "6                 adposition  \n",
       "7                 determiner  \n",
       "8                       noun  \n",
       "9   coordinating conjunction  \n",
       "10                determiner  \n",
       "11                      noun  \n",
       "12                adposition  \n",
       "13                      noun  \n",
       "14               punctuation  "
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bilingual_parsed = pd.concat([pos['en'], pos['es']], axis=1)\n",
    "bilingual_parsed.head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:07:31.758901Z",
     "start_time": "2020-06-20T17:07:31.749990Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"es\" id=\"3e38c993bfd44fc68678ebb98991f3ef-0\" class=\"displacy\" width=\"3050\" height=\"587.0\" direction=\"ltr\" style=\"max-width: none; height: 587.0px; color: white; background: #09a3d5; font-family: Source Sans Pro; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Existe</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"200\">una</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"200\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">estrecha</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"500\">y</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"500\">CCONJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">sorprendente</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"800\">relación</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"800\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"950\">entre</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"950\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">nuestra</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">salud</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1400\">y</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1400\">CCONJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1550\">la</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1550\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1700\">salud</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1700\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1850\">del</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1850\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2000\">océano,</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2000\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">dice</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2300\">el</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2300\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2450\">biologo</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2450\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2600\">marino</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2600\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2750\">Stephen</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2750\">INTJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"497.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2900\">Palumbi.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2900\">INTJ</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-0\" stroke-width=\"2px\" d=\"M62,452.0 62,302.0 2150.0,302.0 2150.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">ccomp</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M62,454.0 L58,446.0 66,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-1\" stroke-width=\"2px\" d=\"M212,452.0 212,352.0 794.0,352.0 794.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M212,454.0 L208,446.0 216,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-2\" stroke-width=\"2px\" d=\"M362,452.0 362,377.0 791.0,377.0 791.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M362,454.0 L358,446.0 366,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-3\" stroke-width=\"2px\" d=\"M512,452.0 512,427.0 635.0,427.0 635.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">cc</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M512,454.0 L508,446.0 516,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-4\" stroke-width=\"2px\" d=\"M362,452.0 362,402.0 638.0,402.0 638.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">conj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M638.0,454.0 L642.0,446.0 634.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-5\" stroke-width=\"2px\" d=\"M62,452.0 62,327.0 797.0,327.0 797.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M797.0,454.0 L801.0,446.0 793.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-6\" stroke-width=\"2px\" d=\"M962,452.0 962,402.0 1238.0,402.0 1238.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M962,454.0 L958,446.0 966,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-7\" stroke-width=\"2px\" d=\"M1112,452.0 1112,427.0 1235.0,427.0 1235.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1112,454.0 L1108,446.0 1116,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-8\" stroke-width=\"2px\" d=\"M812,452.0 812,377.0 1241.0,377.0 1241.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1241.0,454.0 L1245.0,446.0 1237.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-9\" stroke-width=\"2px\" d=\"M1412,452.0 1412,402.0 1688.0,402.0 1688.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">cc</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1412,454.0 L1408,446.0 1416,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-10\" stroke-width=\"2px\" d=\"M1562,452.0 1562,427.0 1685.0,427.0 1685.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-10\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1562,454.0 L1558,446.0 1566,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-11\" stroke-width=\"2px\" d=\"M1262,452.0 1262,377.0 1691.0,377.0 1691.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-11\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">conj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1691.0,454.0 L1695.0,446.0 1687.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-12\" stroke-width=\"2px\" d=\"M1862,452.0 1862,427.0 1985.0,427.0 1985.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-12\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1862,454.0 L1858,446.0 1866,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-13\" stroke-width=\"2px\" d=\"M1712,452.0 1712,402.0 1988.0,402.0 1988.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-13\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1988.0,454.0 L1992.0,446.0 1984.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-14\" stroke-width=\"2px\" d=\"M2312,452.0 2312,427.0 2435.0,427.0 2435.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-14\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2312,454.0 L2308,446.0 2316,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-15\" stroke-width=\"2px\" d=\"M2162,452.0 2162,402.0 2438.0,402.0 2438.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-15\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2438.0,454.0 L2442.0,446.0 2434.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-16\" stroke-width=\"2px\" d=\"M2462,452.0 2462,427.0 2585.0,427.0 2585.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-16\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2585.0,454.0 L2589.0,446.0 2581.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-17\" stroke-width=\"2px\" d=\"M2462,452.0 2462,402.0 2738.0,402.0 2738.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-17\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">appos</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2738.0,454.0 L2742.0,446.0 2734.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-3e38c993bfd44fc68678ebb98991f3ef-0-18\" stroke-width=\"2px\" d=\"M2762,452.0 2762,427.0 2885.0,427.0 2885.0,452.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-3e38c993bfd44fc68678ebb98991f3ef-0-18\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">flat</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2885.0,454.0 L2889.0,446.0 2881.0,446.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(sentences['es'][0].as_doc(), style='dep', jupyter=True, options=options)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "355px",
    "left": "1936.95px",
    "top": "146.953px",
    "width": "305px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
